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Abstract: With the advent of large astronomical equipments, the traditional development model 
for data reduction faces problems such as redundancy of programs and conflicting environmental 
dependencies; Besides as a cluster is a highly coupled computing resource, serious environmental 
conflicts can lead to the unavailability of the entire cluster. To address this problem, we have 
developed a new pipeline framework using the concept of microservices. This paper presents the 
ONSET data pipeline developed through this framework. To achieve near real-time data 
processing, we optimize the core program using MPI and GPU technologies and evaluate the final 
performance. The results show that this development model can be built in a short time to meet the 
requirements of the pipeline, and we believe that this development model has implications for 
future multi-band and multi-terminal astronomical data processing. 
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1. Introduction 


Large astronomical facilities lead astronomy to the era of big data. How to process these data 
become a major problem before the astronomical community at present. Hence, data pipeline is 
one of the key means for astronomers to understand the data. 

Normally, a data pipeline will be developed in either monolith or modular way.The pros and 
cons of these 2 approaches has been discussed for years. And modulization design for data 
pipeline is the main stream up to now.Besides, deploying a data pipeline is rather a hard work for 
most of times, as the dependency and other problems may arise with regard to OS of your 
platform. 

In order to tackle this situation, we try to apply a new framework based on virtualization 
technology for data pipeline development for ONSET at FSO. Some details of this new 
framework will be discussed in this paper. In section 2, we introduce the basic information of 
ONSET, some details of data pipeline based on new framework is presented in Section 3. Then in 
section 4 and 5, we show result of using CUDA for near-realtime data processing. Finally, we 


briefly discussed some possible improvements for the data pipeline for ONSET. 
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2. Overview of ONSET 


The Fuxian Lake Solar Observatory (FSO) is located in Yunnan, China. FSO has two telescope, 
Optical and Near-infrared Solar Eruption Tracer (ON SET)! and lm New Vacuum Solar 
Telescope (NVST). ONSET is jointly developed by Yunnan Observatory and Nanjing Institute of 
Astronomical Optics Technology, and funded by Nanjing University. Its effective diameter is 
275mm and observe the sun in three wavelength: He I 10830A, Ha and white-light at 3600À and 
4250A.show in Tab.1.The corona, chromosphere and photosphere can be observed simultaneously 
in a full-disk or partial-disk solar with a field of 10". At present, there are two acquisition cameras, 
namely Andor Neo and Flash4.0 V3. Neo’ s conventional observation mode is to collect 10 sets of 
full-plane or partial images of the sun every minute, with an image size of 852x852; Flash’s 
regular observation mode is to collect 10 sets of full-plane or partial images of the sun every 30 
seconds, with an image size of 1700x1700. 150 frames per group .data volume is shown in Tab.2. 

Since the data of ONSET and NVST have similar data processing flow, the pipeline framework 
designed can be used to develop the pipeline of ONSET. This has the advantage of reducing 
repetitive work and accelerating pipeline deployment. Next we describe the development of the 
ONSET pipeline. 


Table 1 Channels of ONSET 


Channels Wavelength(À) 
Ha 

6562.8+5 Tunable 
White light 

3600 , 4250 


Near ingrared 
10830+0.85 Tunable 


Table 2 Sampling rate of ONSET 


Camera Data(8h/day) Data Volume(200days) 
Andor Neo 65GB 12.7TB 
Flash4.0 258GB 50.4TB 


3. Data pipeline for ONSET 


At present, data products of ONSET are classified into three levels: 
@ Level 0 data: raw data, unprocessed data collected by the cameras. 
€ Level 0.5 data: level 0 data calibrated by flat and dark processing, then using frame selection 
and reconstruction by speckle interferometry and speckle masking. 
€ Level 1 data: According to filed situation of observations, add headers to level 0.5 data and 
save them in FITS format. Level 1 data is science-ready data and can be used for scientific 
research. 
Transfer Node is responsible for saving the raw data to the storage device and sharing the 
mounting point to computing center via a high-performance network (1OGbE). Finally, the raw 


data is processed into science-ready data by computing center. Then science-ready data is archived 
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through Distribution Node that show in Fig.1. 
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Fig.1. Diagram of ONSET hardware structure 


In October 2020, ONSET decided to develop a new data pipeline. The requirements of this new 
pipeline is as follows: 

1) Using speckle-masking method to process all the data acquired at ONSET; 

2) At the beginning of development, ONSET will provide a Dell R730 server 
(56cores + KA0G PU) for test; and finally the 8-hrs daily observation is required to be processed 
within 3-5 days after observation; 

3) The development of data pipeline will be based on Python; 

4) The quality of science ready data will be assessed by the scientist from ONSET; 
3.1 Algorithm 


At present, speckle mask" and speckle interferometry"! 


are the most widely used in 
high-resolution solar image reconstruction, The implementation is as follows. Speckle 
interferometry is used to reconstruct amplitude and the speckle masking reconstructs the phase. 
The high-resolution reconstructed images with this algorithm were termed Level 0.5 data at 
ONSET, or science-ready data. The algorithm flow chart is shown in Fig.2."! 
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Fig.2. Level 0.5 algorithm of ONSET 


3.2 Pipeline configuration file 

The Pipeline configuration file provides the ability for a user-defined pipeline structure in 
YAML format. The YAML format is a highly readable format for expressing data serialization and 
is ideal for writing configuration files. The keywords of pipeline configuration file is shown in 
Tab.3. 

The data exchange of microservices in data pipeline is based on ZeroMQ’s message forwarding 


mechanism, so users need to determine the connection relationship between microservices and 
bind ports and required resources through pipeline configuration file then Pipeline Parser will 
generate executable script according to the pipeline configuration file, support scaling and 


automatically submit tasks. 


Table 3 pipeline configuration file keywords 


keywords Value Comment 
Pipeline **appname" microservices name 
"nodes" number of nodes required 
“ppn” number of processors on a single node 
“mem” required memory 
“scheduler” resource scheduler(PBS/SLURM) 
parameters "workdir" working directory 
“savedir” saving directory 
— path of speckle imaging transfer 
a“ function 
"script" path of script that created by PipePaser 
"name" name of apps 
“volume” mount the host disks to singularity 
“sif” path of singularity image 
“port” mapped port 
. find interested files implemented by 
*app “retrieval” . 
glob modular in python 
"branch" core functions 
"archive" path of processed data 
"request" requested resources 
"mpi" use mpi or not 
. “loc” expression 
sea “notin” filter 


3.3 Containerization of microservices 

The most important concept in our new framework is microservices, which first came from the 
Internet and refers to applications that can handle specific requests and are relatively independent 
of each other. One of the benefits of microservices-based development is to decouple complex 
functional relationships into multiple subfunctions that are single-functional and easy to maintain, 
so that the number of corresponding microservices can be dynamically increased or decreased to 
achieve scaling when the load changes. More and more programs moved from the original host 
environment or virtual machine environment as a carrier to the container as a carrier! After 
research we choose singularity, a high-performance container technology developed by Lawrence 
Berkeley National Laboratory specifically for large-scale, cross-node (High-Performance 
Computing)HPC and (Deep Learning)DL workloads"!. 

Our framework is to combine the term of microservices and functions of pipelines wrapped 
with singularity. Each microservices is a function in the data reduction wrap in a singularity image. 


Besides other known advantages, another advantage of this framework is to increase the 


performance of the pipeline through scaling by using message passing model. In theory, GPU have 
more computing power than CPU but to achieve 10096 GPU performance requires proficiency in 
grid computing and other professional GPU programming knowledge. In order to release the 
power of GPU, our solution is to use message preemption mechanism, which could increase GPU 
resource utilization through scaling by means of peripheral acceleration and significantly reduce 
data processing time. The data processing speed is linearly related to the number of microservices 
without exceeding the graphics memory, as shown in Fig.3. It can significantly improve GPU 
resource utilization when GPU resource utilization is low. 

The carrier of microservices is containers. The introduction of container technology in the 
pipeline solves the environment dependency problem that has long plagued developers and further 
increases the portability. The data pipeline consists of two modules: 
€ Coordinator :prepare data for other microservices 
€ Pipeline Parser: Parse the defined pipeline configuration file, turn it into a PBS, Slurm or 


standalone script and submit the task 
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Fig.3. Utilization rate of the GPU 


We decouple the pipeline into microservices as follows 
Dark: Dark fielding the raw data. 

Flat: Flat fielding the raw data. 

matching: matching the data with off band. 
corel: levell algorithm for 6563 À. 

core2: levell algorithm for 3600 À and 4250 À 


The pipeline structure diagram based on the container is shown in the Fig.4. 
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Fig.4. The pipeline structure diagram based on the container 


microservices bind port through Pipeline API, the main programs are shown in Tab.4. 
Table 4 Code implementation of bind port through Pipeline API 


Code implementation of bind port through Pipeline API 


import Pipeline API as PA #basic library for HPVDP 
def AppProcess(pipefile): 


pipeline = PA.read pipe(pipefile) #convert the yaml format to a dictionary format 
port in = pipeline[ "Pipeline" ][ '* appname"][' port" ][ ^ir ] #input port 

port out = pipeline[ "Pipeline" ][ ** appname"][' port’ ][ ’out’] #output port 

recv = PA.bindport(port in,"in") #bind the input port 

send = PA.bindport(port out,'out") #bind the output port 


4 GPU Processing within mpidpy 


It will take a large number of short-exposure images with high-frequency information during 
every observation at ONSET. Then we adopt the speckle masking method to reconstruct the phase 
of the image, and speckle interferometry to reconstruct the amplitude of the image. The speckle 
mask method involves the calculation of a 4-dimensional double spectrum and phase recursion, 
which are very time-consuming. Hence we tried to use GPU and MPI technology to speed up the 
processing time of the program. The flow chart of core algorithm is shown in Fig.5. 
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Fig.5 Flow chart of core algorithm based on MPI and GPU 


Within the ONSET program, the whole flow contain these steps: image preprocessing, initial 


image alignment, seeing estimation, speckle interferometry transfer function calculation, image 


block processing, image stitching, and one of the most time-consuming process in the entire 


program is the image block processing. After analysis, we decide to use the CUDA architecture 


from Nvidia GPU to execute block processing in parallel. While improving GPU utilization, we 


also found that CPU utilization is at a low level. In order to improve CPU utilization, the MPI 


programming model is adopted. Fig.6. provides common methods of MPI. 
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(a) comm.scatter(data, root=0) 
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(b) comm.bcast(data, root=0) 


Fig.6. communication category of mpi4py 


(c) comm.gather(data, root=0) 


In MPI programing model, one important thing to do is to scatter tasks to each node. In our case, 


that means to scatter task of image sub-block processing all over the nodes. 


The master node 


will divides the aligned observation images into sub-block groups; then it will assign the 
sub-block calculation tasks to the slave nodes. The master and slave nodes are responsible for 
calculating the sub-block processing, and the result of the master and slave node calculation will 
return to master node. But there is a limitation in MPI which impedes parallelization of data 
chunks with more than 27'-2147483648 elements because the MPI standard uses C int for element 
count, which has a maximum value of 2?! for positive value, When object bigger than that, the 
function will report an error. To avoid this, we divide the whole data into two part for separate 
processing. To make a reasonable task allocation, it will first need to calculate the length of the 
task by the master node, and divide the length of the task by the number of cores involved in the 
calculation. If it can be divided, we use the scatter function and send it to each slave node evenly. 
If it can't be divided, this article uses the method of adding an empty array to make it can be 
divided, and then use the scatter function to evenly send it to each slave node. The subsequent 


calculations will start, and each node returns the corresponding result part to the master node. 


Algorithm 1 Strategy for scatter 


1: procedure DIVIDE- TASKS(tasks list, cores) 

2 task1_list + None 

3 task2_list + None 

4: cores < cores 

5 if comm_rank == 0 then 

6 taskl list + tasks[: int(4 * len(tasks_list))] 

7: task2_list + tasks[int(5 x len(tasks.list)) :] 

8: if len(taskl list)//cores != 0 then 

9: Padding(task1 list, 0, cores) 
10: else if len(task2 list)//cores !— 0 then 
11: Padding(task2 list, 0, cores) 
12: end if 
13: end if 
14: thread tasks = scatter(task1) 
15: result mid1 = StartReconstructed(thread. tasks) 
16: result. final = gather(result) 
17: thread-tasks = scatter(task2) 
18: result mid2 = StartReconstructed(thread. tasks) 
19: result mid3 = gather(result) 


20: APPEND(result. final, result mid3) 
21: return result final 
22: end procedure 


5. Result 


In order to test our data pipeline, we used GPU+CPU and CPU to process the WH data for test. 
The whole environment is list here: 

Hardware: Intel(R) Xeon(R) Gold 5115 CPU @ 2.40GHz x2, 128GB RAM, GeForce RTX 
2080Ti 11GB x4; 

Software: Ubuntu 16.04.6 LTS,CUDA 10.1, GPU driver 430.40, Singularity 3.5.0, OpenMPI 
1.10.7, Python 3.6.10. 

Ten sets of 50 frames (1700x1700pixel) of WH-band data on October 10, 2020 were selected, 


and each frame was chunked in an overlapping fashion with 2704 blocks (96x96pixel) in each 
set. The execution time is averaged for ten sets of data. Experiments have shown that CPU+GPU 
can improve program execution speed compared to CPU mode. results are shown in Tab.5. The 
introduction of MPI and GPU not only brings about performance improvement, but also consumes 
some time due to data preparation and MPI initialization. The statistical results are shown in Tab. 
6. The result is that a small amount of data transfer between the CPU and the GPU does not incur 
significant performance costs but using MPI to collect data and initialize parameters imposes a 
significant performance overhead. 


Table 5 Optimization results in two methods 


Pattern method Elapse time(s) 
: CPU 4607 
Serial 
CPU+GPU 3150 
. CPU 477 
Parallel(MPI with 20 cores) 
CPU+GPU 360 


Table 6 Elapse time in data preparation and initial mpi 


Data preparation and initial mpi(CPU+GPU with 20 cores) Elapse time(s) 


Raw data(host to device) 1.02 
Data transmission between Data for bispectrum and 04 
host(CPU) and device(GPU) | Modulus(host to device) 
in the pipeline Data for phase recursion 0.39 
(device to host) 
Initial MPI 9 
Gather data by MPI 40 


To test the efficiency of the program optimization, we tested the effect of the number of 
processes on execution efficiency in GPU+CPU mode and CPU mode, and the results showed that 
the program running efficiency is less sensitive to the number of processes than in CPU mode 
when using GPU+CPU mode.Fig.7 shows the result. In order to show the reconstruction results, 


we selected HA and White band data respectively. The result is shown in Fig.8 
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Fig.7. Acceleration ratio(left) and execution time(right) in CPU and CPU+GPU mode with 
different number of processes 
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Fig.8. Original images(left) vs Reconstructed images(right) 


6. Conclusion& Discussion 


In this article, we developed the prototype data pipeline using a container-based pipeline 
framework for ONSET and optimized the original CPU program using the GPU, resulting in a 
significant speedup of the original program. We are planning to deploy the data pipeline in near 
future. ONSET’s GPU server contains a tesla K40m GPU and Xeon(R) CPU E5-2680 v4 
@2.4GHz(56 cores). It takes 53s to reconstruct a set of HA-band images and 250s to reconstruct a 
set of WH-band images in this environment, so a day's worth of data (8h) takes about 14h(HA) 
and 66h(WH) to process respectively. And the timing requirement for data reduction at ONSET is 
reached. 

For the whole image reconstruction at ONSET, the multiple layers of judgement and loops 
required by the image phase reconstruction is the most time-consuming.Obviously this logical 
operation is more efficiently executed by the CPU than the GPU, we are working on how to speed 
up this process with CPU now. In general, further improvements to the algorithm are necessary to 
enable near-realtime processing. 

From our practice, it was found that this development model allowed for flexible modifications 
to the microservices, which enables to add more required functionality to the pipeline in time 
without having to modify the entire program. We believe that this container-based development 
model will save a lot of time both in the development and deployment of the astronomical data 
pipelines, while astronomical facilities are now moving towards multi-terminal and 


multi-wavelength. 
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ONSET 数据 流水 线 
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摘要 :， 随 着 天 文大 科学 设备 的 投入 ,传统 的 开发 模式 面临 程序 重复 开发 ， 环 境 依赖 冲突 
等 问题 ; 另外 ,集群 是 一 个 高 度 耦 合 的 计算 资源 ,严重 的 环境 冲突 可 能 导致 整个 集群 的 不 可 
用 。 为 了 解决 这 个 问题 , 我们 采用 微服 务 的 概念 开发 了 新 的 流水 线 框架 ,这 种 框架 可 以 实现 
短期 内 开发 和 部 署 新 的 流水 线 。 本 文 介绍 了 通过 这 种 框架 开发 的 ONSET 数据 流水 线 ,为 了 
实现 近 实 时 数据 处 理 ,我 们 采用 MPI 和 GPU 技术 对 核心 程序 做 了 优化 ,并 对 最 后 的 性 能 做 
了 评估 。 结 果 表 明 这 种 开发 模式 可 以 在 短期 内 搭建 满足 需求 的 流水 线 ,并 且 我 们 相信 这 种 开 
发 模式 对 于 未 来 多 波段 多 终端 的 天 文 数据 处 理 有 借鉴 意义 。 
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